-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/TST: Fixes isnull behavior on NaT in array. Closes #5443 #5524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this needs to be profiled via test_perf.sh |
I ran test_perf.sh on the patch. Here is the vb_log: https://gist.github.com/commonlisp/7488695 |
@@ -213,7 +213,7 @@ def isnullobj(ndarray[object] arr): | |||
n = len(arr) | |||
result = np.zeros(n, dtype=np.uint8) | |||
for i from 0 <= i < n: | |||
result[i] = util._checknull(arr[i]) | |||
result[i] = util._checknull(arr[i]) or arr[i] is NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in util._checknull
itself
Ah, yes, I did have the check in util._checknull at first, but this creates a circular dependency between util and tslib where NaT is defined. Is that ok? Or perhaps NaT should be factored out. |
actually...this whole change is trivial....just change |
Great call. Using checknull now and added the new explicit test case. Here is a fresh test_perf.sh: https://gist.github.com/commonlisp/7492647. Going to run it again just to make sure there aren't any aberrations. |
I added a gist https://gist.github.com/commonlisp/7547232 with another run of test_perf.sh. It appears to agree with the previous run. |
The test_perf.sh runs for modify isnullobj directly and using checknull do differ somewhat (1.22 vs. 1.87). The question is whether this is within test_perf.sh's margin of error. |
that's definitly out of bounds, pls see if its something simple.. |
I've run a good number of performance tests. It seems that util.is_float_object, util.is_complex_object, util.is_datetime64_object, and util.is_timedelta64_object, though declared inline functions in numpy_helper.h, are contributing considerably to the runtime (~1.26-1.5x individually). We could factor out yet another version of lib.checknull that skips those checks to support the lib.isnullobj case. |
ok I guess having a special version of checknull_object might be useful for object arrays where u can skip certain checks (but still need some that _checknull does not do) |
Any reason not to get this merged in for v0.13? |
@cancan101 have to nail down the perf issue, this code is used everywhere |
That makes sense. I wasn't sure if a perf fix was worth waiting for. |
New test_perf.sh for this factoring of lib.checknull: https://gist.github.com/anonymous/8017235 |
@@ -170,20 +170,23 @@ cdef inline int64_t get_timedelta64_value(val): | |||
cdef double INF = <double> np.inf | |||
cdef double NEGINF = -INF | |||
|
|||
cpdef checknull_NaT(object val): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't need to be cpdef (only cdef)
make it _checknull_NaT
as well
make it inline and I don't think you need util.
when calling _checknull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw - sometimes it can actually be faster to manually inline the function
(via copy/paste) vs. using the inline keyword. It's strange but if you
can't pull out more perf, might be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jtratner is right...just put it in 2 places is ok (but need to avoid the arr[i]
twice...
@commonlisp I know this is tedious..but this type of low-level coding affects so many things....
I think this is ready for review again. Please let me know if anyone has feedback. |
can you squash the commits: https://github.com/pydata/pandas/wiki/Using-Git and post a perf run...just to check thanks |
checkout this: https://github.com/pydata/pandas/wiki/Using-Git you don't want to merge in other commits (nor merges) so then
|
Commits squashed, not merging any other commits |
you need to take out all but your commit |
do this:
then force push this will make it build again then'll I merge it.... |
common._isnull_ndarraylike(...) uses lib.isnullobj to check nulls/NaN/NaT in ndarray, which in turn relies on util._checknull. _checknull did not know about NaT, but now lib.isnullobj does, while still maintaining performance by doing arr[i] only once. Added a test case test_isnull_nat() to test_common.py and check for NaT in lib.isnullobj. pd.isnull(np.array([pd.NaT])) now yields the correct results ([True]).
@commonlisp first time working thru the process is painful......next time should be easier! |
@jreback Thank you! |
I added a test case test_isnull_nat() to test_common.py and a check for NaT in lib.isnullobj. pd.isnull(np.array([pd.NaT])) now yields the correct results ([True]).
closes #5443